Overview & Plan

This script implements the feature-importance pipeline for the k-prototypes clustering in a descriptive, geometry-first fashion:

Re-clustering and explicit “minimal viable feature set” selection will be deferred to a dedicated follow-up script (5_parsimonious_feature_selection.Rmd), where subsets will be evaluated by how well they reproduce the full 17-feature solution (e.g., via ARI stability threshold). In the current script, all importance summaries (i.e., centroid separation, MCR, and SHAP) are treated as continuous characterizations of how features drive the 2-cluster geometry, with any high-stringency “keep” sets used for sensitivity descriptions rather than as the sole definition of “important” features

1. Setup & Environment

Goal: Load packages, create output paths, configure parallel workers from SLURM, initialize logging, and generate a reproducible seed list shared across array tasks. Also define IO helper(s)

Set-up beginning
Set-up complete

2. Data & Fitted Clustering Anchor

Goal: Load the cleaned baseline risk dataframe and the final fitted k-prototypes object (scaling currently = z_score, lambda from lambdaest). Compute baseline cluster assignments from the anchor model and coerce risk_dt column types to match the fitted object to ensure consistent distance calculations in downstream predictions

Loaded k-prototypes (from list): k=2, lambda=3.31568
Sanity: predict(kp0, kp0$data) agrees with kp0$cluster in 100.0% of rows.

3. Helper Functions

Goal: Implement reusable helpers aimed at the following:

  • permute_col() - type-preserving permutation of a given variable used for cluster reassignment
  • feature_mcr() - permutation MCR vs a base assignment for a single feature
  • shadow_baseline() - global shadow baseline per seed
  • shadow_baseline_feat() - shadow baseline for each feature
  • compute_redundancy() - |rho| and Cramer’s V flags for redundancy among retained features
  • centroid_profiles() - descriptive numeric/categorical summaries for the centroid of each cluster

4. Descriptive Centroid Profiles

Goal: Compute and save descriptive centroid profiles (numeric means/SDs and categorical modes/proportions) for reference, then summarise geometry-based separation and add collinearity + visual diagnostics

4a. Centroid Geometry

Goal: Quantify how far apart the clusters sit along each numeric feature in the z-scored space (delta-z, Cohen’s d, and ANOVA p-values) and rank features by geometry-based separation to inform later visualization and interpretation

4b. Correlation Diagnostics

Goal: Characterize the correlation/collinearity structure among the 17 risk features, highlighting correlated “blocks” (e.g., symptom burden) that may behave interchangeably in permutation and SHAP analyses

png 
  2 

4c. Visual Separation Diagnostics

Goal: Provide a visual sanity check that clusters separate in 1D/2D space along the top numeric features, using univariate densities, bivariate scatterplots with centroids, and a PCA view to complement the centroid and MCR rankings

5. Seed-Level Importance Array Task

Goal: For a given seed, re-fit k-prototypes (same k, lambda, nstart), compute per-feature MCR with biter permutations, evaluate the shadow-max threshold, and emit a per-seed CSV. These are treated as continuous descriptive measures of how sensitive the cluster labels are to perturbations of each feature, not as a sole keep/drop gate for said risk features

6. Dispatch Seed Array Job

Goal: Run a single-seed task when launched as a SLURM array job, or iterate all seeds locally when testing without SLURM. Array tasks exit early after writing their outputs

7. Aggregate & Descriptive MCR Summary

Goal: Aggregate per-seed outputs to compute mean MCR, SD, and win-rate (fraction of seeds where MCR exceeds feature-specific shadow thresholds). Treat these as continuous rankings of how sensitive the clustering solution is to perturbations in each feature. A high-stringency “keep” subset (stability + FDR) is still derived but used primarily for sensitivity descriptions, not as the sole definition of “important” features

8. Surrogate Triangulation of Cluster-Relevant Features Using XGBoost + SHAP

Goal: Train a regularized XGBoost surrogate to predict cluster labels with stratified v-fold OOF evaluation (balanced accuracy and macro-F1). Fit a final model on all data to compute SHAP global importances for descriptive corroboration of which features contribute most to discriminating the 2 clusters. SHAP is used here exclusively for visualization/triangulation, not as a hard selection criterion

png 
  2 

9. Reproducibility Log

Goal: Record parameters (n, p, k, lambda, scaling, biter, n_seeds, shadow_B, stability_keep) and container hash to run_log.json for traceability of this run